Skip to content

Conversation

@praneshgo
Copy link
Contributor

Description

  • Introduced external GPU synchronization primitives enabling interoperability with Direct3D 12 and Vulkan graphics APIs.
  • Added APIs to acquire, wait on, and signal external GPU synchronization fences or semaphores in execution providers.
  • Enabled setup of CUDA Interop Graphics (CIG) contexts to synchronize CUDA streams with graphics API command queues/devices.
  • Extended NVIDIA TensorRT RTX execution provider with support for importing and synchronizing external semaphores from graphics APIs.
  • Added execution provider interface methods for managing external synchronization primitives and interop synchronization.
  • Added build options to enable DirectX and Vulkan interoperability features.
  • Added comprehensive GPU interoperability test validating D3D12 and ONNX Runtime synchronization with TensorRT RTX.

Motivation and Context

This commit largely focuses on adding the design keeping in mind sample apps using either DX or Vulkan Graphics API.
The implementation part however focuses solely on DX for now, with plans to add Vulkan implementation support soon.
NV RTR RTX EP gets implementation details added and virtual base APIs are added in execution_provider.h header to have fallback path for other EPs.
- Fixed the Native Methods to include binding delegations
- Cleaned up GetExtSemaphore API to take better care of return values
- Made returns clearer while looping active EPs
- Fixed the pointer assignment and dereferencing in fallback path APIs
- Fixed the namespace in GetOrtFenceForGraphicsInterop API
- in InteropEpWait and InteropEPSignal, added return calls when stream is nullptr
- Adding a compile parameter use_dx_for_interop. When specified, it enables a macro DX_FOR_INTEROP that enables d3d12.h inclusion and all other DX specific logic for compilation
- Couple of small fixes
… and EpWait APIs

- Modified GetOrtFenceForGraphicsInterop API to use pointer to GraphicsInteropParams
- Added documentation for exposed APIs
- Updated error handling in EpSignal and EpWait APIs
…eropParams and addressing current review comments

- Adding Vulkan compilation support
- Adding more members to GraphicsInteropParams
- Addressing current review comments
Adding CIG support for DX - expected to work with TRT 1.3. Used the corresponding local fixes in TRT to verify the working of sample app that uses CIG+NV TRT RTX EP.
- Added a struct to store extSemFence and the corresponding selectedEp at GetExtSemaphore API
- Not looping for EPs in InteropWait and InteropSignal calls, using the selected EP instead.
… made an opaque struct

- Bifurcated fence params from graphicsInteropParams
- made SemaphoreEpMap an opaque struct, defined in ORT but app does not call it. Instead it uses a void*
- Some additional code fixes called out by CodeRabbit
- Added an opaque OrtFence struct. Replaced void*/void** with OrtFence*/OrtFence** for fence related params.
- Replaced passing stream handle to EP with passing stream directly.
- Changed the name of SetupCigContextForEpDevice to SetupGraphicsInteropContextForEpDevice
- Made the datatypes of DX/Vulkan as void* and removed the corresponding DX/Vulkan headers in onnxruntime_c_api.h
- Moved the DX/Vulkan header inclusion into execution_provider.h
- Reinterpret casting into appropriate data types from void* where needed
Add a to-do regarding further changes needed in fallback code
Not using pVkSempahore as a pointer to VkSemaphore but the semaphore handle directly, hence renaming the variable accordlingly
…RT RTX EP device.

Validated to be working as expected when interop APIs are setup and outputs garbage when interop APIs are commented out.
@@ -0,0 +1,438 @@
// SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Check warning

Code scanning / lintrunner

CLANGFORMAT/format Warning test

See https://clang.llvm.org/docs/ClangFormat.html.
Run lintrunner -a to apply this patch.
Comment on lines +533 to +546
typedef struct FenceInteropParams {
ExternalSyncPrimitive extSyncPrimitive;
union FencePtr {
void* pFence;
void* pVkFence;
void* VkSemaphore;
} FencePtr;
VulkanDeviceParams VulkanDeviceParams;
} FenceInteropParams;

typedef struct SemaphoreEpMap {
void* extSemFence;
void* selectedEp;
} SemaphoreEpMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of 'fence' vs 'semaphore' in a number of places is confusing/inconsistent. Would be good to explain what exactly is used for DX vs VK and why. I expected these changes were only for device to device synchronization yet based on the VK doco a vkFence is for CPU to GPU sync so it's not clear to me why we need that.

We have a SemaphoreEpMap with an extSemFence which I assume could be either but the type name doesn't indicate that.
Conflicting with that is FenceInteropParams that contains fences or a semaphore.


/** \brief Setup Graphics Interopcontext for an execution provider device.
*
* This function enables D3D12/Vulkan interoperability by creating a CUDA context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything in the ORT API needs to be EP agnostic. I think it's ok for the documentation to give an example that's CUDA specific, but the general documentation on what the function does should be generic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comments and API to be EP agnostic.

public:
virtual ~IExecutionProvider() = default;

virtual Status GetExtSemaphore(const struct GraphicsInteropParams* graphicsInteropParams, struct FenceInteropParams* fenceInteropParams, void** extSemFence) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IExecutionProvider is an internal type. As such I wouldn't expect it's API to be using things like void** extSemFence as conversion from opaque API types to the 'real' internal types should be handled by the PluginExecutionProvider class which implements IExecutionProvider for plugin EPs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I will modify the code in the next commit based on this suggestion.

Comment on lines 6649 to 6650
ORT_API2_STATUS(SetupGraphicsInteropContextForEpDevice, _In_ const OrtEpDevice* ep_device,
_In_ const struct GraphicsInteropParams* graphicsInteropParams);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naively I would have expected this to be setup on a per-Session basis. An OrtEpDevice is an Environment level type so used for all sessions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to SetupGraphicsInteropForEpDevice to better reflect its usage. The idea is that this API must be called at the beginning in order to setup graphics interop on all the sessions associated with an OrtEpDevice correctly.

*
* \since Version 1.24.
*/
ORT_API2_STATUS(SetupCigContext, _In_ OrtEpFactory* this_ptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality needs to be made generic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the API name to SetupGraphicsContext to make it more generic. The functionality is generic already, EPs are expected to have their implementation in EP code files. In case of NV TRT RTX EP, SetupGraphicsContext = SetupCigContextImpl. This assignment and implementation of SetupCigContextImpl is present in onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc

*
* \since Version 1.24
*/
ORT_API2_STATUS(InteropEpSignal, _In_ OrtSession* session, _In_ OrtFence* ortFence, _In_ OrtSyncStream* stream, _In_ uint64_t fenceValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the naming of InteropEpWait/InteropEpSignal as the usage of EP/s within a session is somewhat opaque.

I would have expected the wait/signal to be around a call to OrtSession Run, with ORT figuring out when/where an EP is connected to a fence/semaphore for input/output. Should the wait/signal instead by coupled to the Run? The current setup also means you could not have concurrent calls to Run which feels like it could be overly restrictive.

Copy link
Contributor Author

@praneshgo praneshgo Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the naming of InteropEpWait/InteropEpSignal as the usage of EP/s within a session is somewhat opaque.

InteropEpWait and InteropEpSignal are designed as generic synchronization APIs that users can invoke on demand. They operate independently of the session context. To make this explicit and remove ambiguity, I have updated both API signatures to remove the OrtSession parameter.

The current setup also means you could not have concurrent calls to Run which feels like it could be overly restrictive.

Can you kindly provide an example of this scenario, so I may better understand?

*
* \since Version 1.24
*/
ORT_API2_STATUS(GetOrtFenceForGraphicsInterop, _In_ OrtSession* session, _In_ const struct GraphicsInteropParams* graphicsInteropParams, _In_ struct FenceInteropParams* fenceInteropParams, _Outptr_ OrtFence** ortFence);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who owns ortFence and how/when should it be released?

Comments seem to be out of date.

Why do fenceInteropParams need to be non-const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Application owns ortFence and it is responsible for releasing it when done.
Updated comments in the latest commit.
fenceInteropParams is made const struct.


Status NvExecutionProvider::GetExtSemaphore(const struct GraphicsInteropParams* graphicsInteropParams, struct FenceInteropParams* fenceInteropParams, void** extSemFence)
{
if (!info_.has_user_compute_stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_user_compute_stream is very specific to this EP. how will this work for an EP that doesn't currently support user provided streams? would they create a stream using the ORT API and pass it in somehow (e.g. in the params to one of the new API calls)? we don't really want to manually have to add an option to all the EPs for this sort of thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be no way today to associate an external OrtSyncStream with a session in an EP-agnostic way—something like sessionOptions.setOrtSyncStream. If there is an option to do that, we can certainly avoid passing EP-specific options like has_user_compute_stream.

…e generic

- Renaming APIs to make them seem more generic
- Made fenceInteropParams a const struct in GetExtSemaphore
- Modified SetupGraphicsInteropContextForEpDevice to be called SetupGraphicsInteropForEpDevice, Removed occurrences of context in all the child functions it invokes.
- Removed session paramter from InteropEpSignal and InteropEpWait as it is a generic API and it is not dependent on a session
@skottmckay
Copy link
Contributor

skottmckay commented Dec 18, 2025

Would be great to get your initial thoughts on #26821. I think that's more consistent in approach to other things the plugin EP is providing, as well as with the standard Create/Release setup the ORT API uses. The proposed approach also includes memory handling which other IHVs have asked for. I think there are some gaps, but that structure feels a lot cleaner to me with more explicit usage.

I have a lot of comments about the PR having spent the last day or so going through it, but if we want to move to something closer to that setup quite a few will be invalidated.

Some apply here as well though

  • should the OrtSyncStream to use for an inference be passed in as part of the call to Run so that there's potentially more than one stream that can be used for different inferences/sessions
  • would be better to get the OrtEpDevice for an input from SessionGetEpDeviceForInputs so you know for sure that the EP will be consuming the input.
  • all structs need to be versioned

Are the Vulkan changes here tested? I didn't see anything Vulkan specific in the unit test.

@skottmckay
Copy link
Contributor

Can you take a look at the PRs Nick has added with initial implementation to see if there are any gaps in what that provides?

#26828
#26829

@praneshgo
Copy link
Contributor Author

Can you take a look at the PRs Nick has added with initial implementation to see if there are any gaps in what that provides?

#26828 #26829

Hi Scott
Sure. I will take a look at these PRs and provide feedback. Thanks.

@skottmckay
Copy link
Contributor

Can you take a look at the PRs Nick has added with initial implementation to see if there are any gaps in what that provides?
#26828 #26829

Hi Scott Sure. I will take a look at these PRs and provide feedback. Thanks.

Important that the setup works for Level Zero as well, which I don't believe has an equivalent to cuCtxCreate_v4 (please correct me if I'm wrong).

Possibly we could support using external semaphores as the general-purpose usage and maybe have an 'attach to device' option to enable using cuCtxCreate_v4 as an additional optimization. If you attached to the device first, would the pattern of calling 'import semaphore' followed by wait/signal with an OrtSyncStream and the result of the import still be viable? Possibly the implementation would use different types if attached to device, but if the API is generic enough that should all be opaque to ORT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants